127 research outputs found
Improving the Evolutionary Coding for Machine Learning Tasks
The most influential factors in the quality of the solutions
found by an evolutionary algorithm are a correct coding of the
search space and an appropriate evaluation function of the potential
solutions. The coding of the search space for the obtaining of decision
rules is approached, i.e., the representation of the individuals of
the genetic population. Two new methods for encoding discrete and
continuous attributes are presented. Our “natural coding” uses one
gene per attribute (continuous or discrete) leading to a reduction in
the search space. Genetic operators for this approached natural coding
are formally described and the reduction of the size of the search
space is analysed for several databases from the UCI machine learning
repository.Comisión Interministerial de Ciencia y Tecnología TIC1143–C03–0
Fast Feature Ranking Algorithm
The attribute selection techniques for supervised learning, used in the preprocessing phase to emphasize the most relevant attributes, allow making models of classification simpler and easy to understand. The algorithm has some interesting characteristics: lower computational cost (O(m n log n) m attributes and n examples in the data set) with respect to other typical algorithms due to the absence of distance and statistical calculations; its applicability to any labelled data set, that is to say, it can contain continuous and discrete variables, with no need for transformation. In order to test the relevance of the new feature selection algorithm, we compare the results induced by several classifiers before and after applying the feature selection algorithms
Fast Feature Selection by Means of Projections
The attribute selection techniques for supervised learning, used in the preprocessing phase to emphasize the most relevant attributes, allow making models of classification simpler and easy to understand. The algorithm (SOAP: Selection of Attributes by Projection) has some interesting characteristics: lower computational cost (O(m n log n) m attributes and n examples in the data set) with respect to other typical algorithms due to the absence of distance and statistical calculations; its applicability to any labelled data set, that is to say, it can contain continuous and discrete variables, with no need for transformation. The performance of SOAP is analyzed in two ways: percentage of reduction and classification. SOAP has been compared to CFS [4] and ReliefF [6]. The results are generated by C4.5 before and after the application of the algorithms
Gene Ranking from Microarray Data for Cancer Classification : A Machine Learning Approach
Traditional gene selection methods often select the
top–ranked genes according to their individual discriminative power. We
propose to apply feature evaluation measure broadly used in the machine
learning field and not so popular in the DNA microarray field. Besides,
the application of sequential gene subset selection approaches is included.
In our study, we propose some well-known criteria (filters and wrappers)
to rank attributes, and a greedy search procedure combined with three
subset evaluation measures. Two completely different machine learning
classifiers are applied to perform the class prediction. The comparison is
performed on two well–known DNA microarray data sets. We notice that
most of the top-ranked genes appear in the list of relevant–informative
genes detected by previous studies over these data sets.Comisión Interministerial de Ciencia y Tecnología (CICYT) TIN2004–00159Comisión Interministerial de Ciencia y Tecnología (CICYT) TIN2004-06689C030
Biclustering on expression data: A review
Biclustering has become a popular technique for the study of gene expression data, especially for discovering functionally related gene sets under different subsets of experimental conditions. Most of biclustering approaches use a measure or cost function that determines the quality of biclusters. In such cases, the development of both a suitable heuristics and a good measure for guiding the search are essential for discovering interesting biclusters in an expression matrix. Nevertheless, not all existing biclustering approaches base their search on evaluation measures for biclusters. There exists a diverse set of biclustering tools that follow different strategies and algorithmic concepts which guide the search towards meaningful results. In this paper we present a extensive survey of biclustering approaches, classifying them into two categories according to whether or not use evaluation metrics within the search method: biclustering algorithms based on evaluation measures and non metric-based biclustering algorithms. In both cases, they have been classified according to the type of meta-heuristics which they are based on.Ministerio de Economía y Competitividad TIN2011-2895
Evolutionary Biclustering based on Expression Patterns
The majority of the biclustering approaches for
microarray data analysis use the Mean Squared Residue (MSR)
as the main evaluation measure for guiding the heuristic.
MSR has been proven to be inefficient to recognize several
kind of interesting patterns for biclusters. Transposed Virtual
Error (VEt ) has recently been discovered to overcome MSR
drawbacks, being able to recognize shifting and/or scaling
patterns. In this work we propose a parallel evolutionary
biclustering algorithm which uses VEt as the main part of
the fitness function, which has been designed using the volume
and overlapping as other objectives to optimize. The resulting
algorithm has been tested on both synthetic and benchmark
real data producing satisfactory results. These results has been
compared to those of the most popular biclustering algorithm
developed by Cheng and Church and based in the use of MSR.Ministerio de Ciencia y Tecnología TIN2007-68084-C02-0
Measuring the Quality of Shifting and Scaling Patterns in Biclusters
The most widespread biclustering algorithms use the Mean Squared Residue (MSR) as measure for assessing the quality of biclusters. MSR can identify correctly shifting patterns, but fails at discovering biclusters presenting scaling patterns. Virtual Error (VE) is a measure which improves the performance of MSR in this sense, since it is effective at recognizing biclusters containing shifting patters or scaling patterns as quality biclusters. However, VE presents some drawbacks when the biclusters present both kind of patterns simultaneously. In this paper, we propose a improvement of VE that can be integrated in any heuristic to discover biclusters with shifting and scaling patterns simultaneously.Ministerio de Ciencia y Tecnología TIN2007-68084-C02-0
Shifting Patterns Discovery in Microarrays with Evolutionary Algorithms
In recent years, the interest in extracting useful knowledge from gene expression data has experimented an enormous increase with the development of microarray technique. Biclustering is a recent technique that aims at extracting a subset of genes that show a similar behaviour for a subset conditions. It is important, therefore, to measure the quality of a bicluster, and a way to do that would be checking if each data submatrix follows a specific trend, represented by a pattern. In this work, we present an evolutionary algorithm for finding significant shifting patterns which depict the general behaviour within each bicluster. The empirical results we have obtained confirm the quality of our proposal, obtaining very accurate solutions for the biclusters used.Comisión Interministerial de Ciencia y Tecnología (CICYT) TIN2004-00159Comisión Interministerial de Ciencia y Tecnología (CICYT) TIN2004-06689C030
Configurable Pattern-based Evolutionary Biclustering of Gene Expression Data
BACKGROUND: Biclustering algorithms for microarray data aim at discovering functionally related gene sets under different subsets of experimental conditions. Due to the problem complexity and the characteristics of microarray datasets, heuristic searches are usually used instead of exhaustive algorithms. Also, the comparison among different techniques is still a challenge. The obtained results vary in relevant features such as the number of genes or conditions, which makes it difficult to carry out a fair comparison. Moreover, existing approaches do not allow the user to specify any preferences on these properties. RESULTS: Here, we present the first biclustering algorithm in which it is possible to particularize several biclusters features in terms of different objectives. This can be done by tuning the specified features in the algorithm or also by incorporating new objectives into the search. Furthermore, our approach bases the bicluster evaluation in the use of expression patterns, being able to recognize both shifting and scaling patterns either simultaneously or not. Evolutionary computation has been chosen as the search strategy, naming thus our proposal Evo-Bexpa (Evolutionary Biclustering based in Expression Patterns). CONCLUSIONS: We have conducted experiments on both synthetic and real datasets demonstrating Evo-Bexpa abilities to obtain meaningful biclusters. Synthetic experiments have been designed in order to compare Evo-Bexpa performance with other approaches when looking for perfect patterns. Experiments with four different real datasets also confirm the proper performing of our algorithm, whose results have been biologically validated through Gene Ontology
Searching for rules to detect defective modules: A subgroup discovery approach
Data mining methods in software engineering are becoming increasingly important as they
can support several aspects of the software development life-cycle such as quality. In this
work, we present a data mining approach to induce rules extracted from static software
metrics characterising fault-prone modules. Due to the special characteristics of the defect
prediction data (imbalanced, inconsistency, redundancy) not all classification algorithms
are capable of dealing with this task conveniently. To deal with these problems, Subgroup
Discovery (SD) algorithms can be used to find groups of statistically different data given a
property of interest. We propose EDER-SD (Evolutionary Decision Rules for Subgroup Discovery),
a SD algorithm based on evolutionary computation that induces rules describing
only fault-prone modules. The rules are a well-known model representation that can be
easily understood and applied by project managers and quality engineers. Thus, rules
can help them to develop software systems that can be justifiably trusted. Contrary to
other approaches in SD, our algorithm has the advantage of working with continuous variables
as the conditions of the rules are defined using intervals. We describe the rules
obtained by applying our algorithm to seven publicly available datasets from the PROMISE
repository showing that they are capable of characterising subgroups of fault-prone modules.
We also compare our results with three other well known SD algorithms and the
EDER-SD algorithm performs well in most cases.Ministerio de Educación y Ciencia TIN2007-68084-C02-00Ministerio de Educación y Ciencia TIN2010-21715-C02-0
- …